In [1]:
# Import Libraries

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set(color_codes=True)

from sklearn import preprocessing
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

from scipy.cluster.hierarchy import dendrogram, linkage, cophenet, fcluster
from scipy.spatial.distance import cdist, pdist

import plotly.express as px

Load Data

In [2]:
cardData = pd.read_excel('Credit Card Customer Data.xlsx')
cardData.head()
Out[2]:
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
0 1 87073 100000 2 1 1 0
1 2 38414 50000 3 0 10 9
2 3 17341 50000 7 1 3 4
3 4 40496 30000 5 1 1 4
4 5 47437 100000 6 0 12 3

Deliverable – 1

Univariate analysis

In [3]:
# Shape of data
cardData.shape
Out[3]:
(660, 7)
In [4]:
rowCount = cardData.shape[0]
rowCount
Out[4]:
660
In [5]:
# Data Types
cardData.dtypes
Out[5]:
Sl_No                  int64
Customer Key           int64
Avg_Credit_Limit       int64
Total_Credit_Cards     int64
Total_visits_bank      int64
Total_visits_online    int64
Total_calls_made       int64
dtype: object
In [6]:
# Null check
cardData.isnull().sum()
Out[6]:
Sl_No                  0
Customer Key           0
Avg_Credit_Limit       0
Total_Credit_Cards     0
Total_visits_bank      0
Total_visits_online    0
Total_calls_made       0
dtype: int64
In [7]:
# number of unique values
cardData.nunique()
Out[7]:
Sl_No                  660
Customer Key           655
Avg_Credit_Limit       110
Total_Credit_Cards      10
Total_visits_bank        6
Total_visits_online     16
Total_calls_made        11
dtype: int64
In [8]:
# Unique values & counts i.e. value counts

for col in cardData.columns:
    print('----------- Value Counts of ', col, ' -----------')
    print(cardData.groupby(by=col)[col].value_counts())
    print('-------------------------------------')
    print()
    
----------- Value Counts of  Sl_No  -----------
Sl_No  Sl_No
1      1        1
2      2        1
3      3        1
4      4        1
5      5        1
               ..
656    656      1
657    657      1
658    658      1
659    659      1
660    660      1
Name: Sl_No, Length: 660, dtype: int64
-------------------------------------

----------- Value Counts of  Customer Key  -----------
Customer Key  Customer Key
11265         11265           1
11398         11398           1
11412         11412           1
11466         11466           1
11562         11562           1
                             ..
99437         99437           1
99473         99473           1
99589         99589           1
99596         99596           1
99843         99843           1
Name: Customer Key, Length: 655, dtype: int64
-------------------------------------

----------- Value Counts of  Avg_Credit_Limit  -----------
Avg_Credit_Limit  Avg_Credit_Limit
3000              3000                 1
5000              5000                21
6000              6000                31
7000              7000                24
8000              8000                35
                                      ..
184000            184000               1
186000            186000               1
187000            187000               1
195000            195000               2
200000            200000               1
Name: Avg_Credit_Limit, Length: 110, dtype: int64
-------------------------------------

----------- Value Counts of  Total_Credit_Cards  -----------
Total_Credit_Cards  Total_Credit_Cards
1                   1                      59
2                   2                      64
3                   3                      53
4                   4                     151
5                   5                      74
6                   6                     117
7                   7                     101
8                   8                      11
9                   9                      11
10                  10                     19
Name: Total_Credit_Cards, dtype: int64
-------------------------------------

----------- Value Counts of  Total_visits_bank  -----------
Total_visits_bank  Total_visits_bank
0                  0                    100
1                  1                    112
2                  2                    158
3                  3                    100
4                  4                     92
5                  5                     98
Name: Total_visits_bank, dtype: int64
-------------------------------------

----------- Value Counts of  Total_visits_online  -----------
Total_visits_online  Total_visits_online
0                    0                      144
1                    1                      109
2                    2                      189
3                    3                       44
4                    4                       69
5                    5                       54
6                    6                        1
7                    7                        7
8                    8                        6
9                    9                        4
10                   10                       6
11                   11                       5
12                   12                       6
13                   13                       5
14                   14                       1
15                   15                      10
Name: Total_visits_online, dtype: int64
-------------------------------------

----------- Value Counts of  Total_calls_made  -----------
Total_calls_made  Total_calls_made
0                 0                    97
1                 1                    90
2                 2                    91
3                 3                    83
4                 4                   108
5                 5                    29
6                 6                    39
7                 7                    35
8                 8                    30
9                 9                    32
10                10                   26
Name: Total_calls_made, dtype: int64
-------------------------------------

In [9]:
# Statistical summary of numeric columns
cardData.describe().T
Out[9]:
count mean std min 25% 50% 75% max
Sl_No 660.0 330.500000 190.669872 1.0 165.75 330.5 495.25 660.0
Customer Key 660.0 55141.443939 25627.772200 11265.0 33825.25 53874.5 77202.50 99843.0
Avg_Credit_Limit 660.0 34574.242424 37625.487804 3000.0 10000.00 18000.0 48000.00 200000.0
Total_Credit_Cards 660.0 4.706061 2.167835 1.0 3.00 5.0 6.00 10.0
Total_visits_bank 660.0 2.403030 1.631813 0.0 1.00 2.0 4.00 5.0
Total_visits_online 660.0 2.606061 2.935724 0.0 1.00 2.0 4.00 15.0
Total_calls_made 660.0 3.583333 2.865317 0.0 1.00 3.0 5.00 10.0
In [10]:
def getQuartileCounts(col):
    print('Quartile Analysis of ', col)
    print()
    Min = cardData[col].min()
    Max = cardData[col].max()
    Std = cardData[col].std()
    Mean = cardData[col].mean()
    Median = cardData[col].median()
    Q1 = cardData[col].quantile(q=0.25)
    Q2 = cardData[col].quantile(q=0.50)
    Q3 = cardData[col].quantile(q=0.75)
    IQR = Q3-Q1
    Minimum = Q1 - (1.5 * IQR)
    Maximum = Q3 + (1.5 * IQR)
    LeftOutlierCount = cardData[cardData[col]<Minimum].shape[0]
    Q1Count = cardData[(cardData[col]>=Minimum) & (cardData[col]<=Q1)].shape[0]
    Q2Count = cardData[(cardData[col]>Q1) & (cardData[col]<=Q2)].shape[0]
    Q3Count = cardData[(cardData[col]>Q2) & (cardData[col]<=Q3)].shape[0]
    Q4Count = cardData[(cardData[col]>Q3) & (cardData[col]<=Maximum)].shape[0]
    RightOutlierCount = cardData[cardData[col]>Maximum].shape[0]
    print('Min ', Min, " Value count: ", cardData[cardData[col] == Min].shape[0])
    print('Max ', Max, " Value count: ", cardData[cardData[col] == Max].shape[0])
    print()
    print('Standard Deviation ', Std)
    print('Mean ', Mean)
    print('Median ', Median)
    print()
    print('25th percentile Q1 ', Q1)
    print('50th percentile Q2 ', Q2)
    print('75th percentile Q3 ', Q3)
    print('IQR ', IQR)
    print('Minimum = Q1 - 1.5*IQR = ', Minimum)
    print('Maximum = Q3 + 1.5*IQR = ', Maximum)
    print()
    print('Left outlier count i.e. < Minimum  ', LeftOutlierCount)
    print('>= Minimum and <= Q1 count         ', Q1Count)
    print('> Q1 and <= Q2 count               ', Q2Count)
    print('> Q2 and <= Q3 count               ', Q3Count)
    print('> Q3 and <= Maximum                ', Q4Count)
    print('Right outlier count i.e. > Maximum ', RightOutlierCount)
    print('Total Outliers ', (LeftOutlierCount+RightOutlierCount), '  ', (LeftOutlierCount+RightOutlierCount)*100/rowCount, '%', ' of total records')
    print()
    if(Mean > Median):
        print('Distribution is Right Skewed because Mean > Median')
    elif(Mean < Median):
        print('Distribution is Left Skewed because Mean < Median')
    else:
        print('Distribution is Symmetric because Mean = Median')
        
        
def plotUnivariate(col, pltType):
    if (pltType == 'box-dist'):
        fig, axes = plt.subplots(1, 2, figsize=[15,5])
        fig.tight_layout(pad=5.0)
        getQuartileCounts(col)
        sns.boxplot(data=cardData, x=col, ax=axes[0]);
        sns.distplot(cardData[col], ax=axes[1]);
In [11]:
# Plot for Avg_Credit_Limit
plotUnivariate('Avg_Credit_Limit', 'box-dist')
Quartile Analysis of  Avg_Credit_Limit

Min  3000  Value count:  1
Max  200000  Value count:  1

Standard Deviation  37625.48780422165
Mean  34574.242424242424
Median  18000.0

25th percentile Q1  10000.0
50th percentile Q2  18000.0
75th percentile Q3  48000.0
IQR  38000.0
Minimum = Q1 - 1.5*IQR =  -47000.0
Maximum = Q3 + 1.5*IQR =  105000.0

Left outlier count i.e. < Minimum   0
>= Minimum and <= Q1 count          166
> Q1 and <= Q2 count                178
> Q2 and <= Q3 count                152
> Q3 and <= Maximum                 125
Right outlier count i.e. > Maximum  39
Total Outliers  39    5.909090909090909 %  of total records

Distribution is Right Skewed because Mean > Median

Looking at number of peaks in the distribution, 3 clusters can be formed based on Average Credit limit.

In [12]:
# Plot for Total_Credit_Cards
plotUnivariate('Total_Credit_Cards', 'box-dist')
Quartile Analysis of  Total_Credit_Cards

Min  1  Value count:  59
Max  10  Value count:  19

Standard Deviation  2.1678348595111934
Mean  4.706060606060606
Median  5.0

25th percentile Q1  3.0
50th percentile Q2  5.0
75th percentile Q3  6.0
IQR  3.0
Minimum = Q1 - 1.5*IQR =  -1.5
Maximum = Q3 + 1.5*IQR =  10.5

Left outlier count i.e. < Minimum   0
>= Minimum and <= Q1 count          176
> Q1 and <= Q2 count                225
> Q2 and <= Q3 count                117
> Q3 and <= Maximum                 142
Right outlier count i.e. > Maximum  0
Total Outliers  0    0.0 %  of total records

Distribution is Left Skewed because Mean < Median

Looking at number of peaks in the distribution, 5 clusters can be formed based on Total Credit Cards.

In [13]:
# Plot for Total_visits_bank
plotUnivariate('Total_visits_bank', 'box-dist')
Quartile Analysis of  Total_visits_bank

Min  0  Value count:  100
Max  5  Value count:  98

Standard Deviation  1.6318128757916113
Mean  2.403030303030303
Median  2.0

25th percentile Q1  1.0
50th percentile Q2  2.0
75th percentile Q3  4.0
IQR  3.0
Minimum = Q1 - 1.5*IQR =  -3.5
Maximum = Q3 + 1.5*IQR =  8.5

Left outlier count i.e. < Minimum   0
>= Minimum and <= Q1 count          212
> Q1 and <= Q2 count                158
> Q2 and <= Q3 count                192
> Q3 and <= Maximum                 98
Right outlier count i.e. > Maximum  0
Total Outliers  0    0.0 %  of total records

Distribution is Right Skewed because Mean > Median

Looking at number of peaks in the distribution, 3 clusters can be formed based on Total Bank Visits.

In [14]:
# Plot for Total_visits_online
plotUnivariate('Total_visits_online', 'box-dist')
Quartile Analysis of  Total_visits_online

Min  0  Value count:  144
Max  15  Value count:  10

Standard Deviation  2.9357241204935414
Mean  2.606060606060606
Median  2.0

25th percentile Q1  1.0
50th percentile Q2  2.0
75th percentile Q3  4.0
IQR  3.0
Minimum = Q1 - 1.5*IQR =  -3.5
Maximum = Q3 + 1.5*IQR =  8.5

Left outlier count i.e. < Minimum   0
>= Minimum and <= Q1 count          253
> Q1 and <= Q2 count                189
> Q2 and <= Q3 count                113
> Q3 and <= Maximum                 68
Right outlier count i.e. > Maximum  37
Total Outliers  37    5.606060606060606 %  of total records

Distribution is Right Skewed because Mean > Median

Looking at number of peaks in the distribution, 4 clusters can be formed based on Total Online Visits.

In [15]:
# Plot for Total_calls_made
plotUnivariate('Total_calls_made', 'box-dist')
Quartile Analysis of  Total_calls_made

Min  0  Value count:  97
Max  10  Value count:  26

Standard Deviation  2.865316817622708
Mean  3.5833333333333335
Median  3.0

25th percentile Q1  1.0
50th percentile Q2  3.0
75th percentile Q3  5.0
IQR  4.0
Minimum = Q1 - 1.5*IQR =  -5.0
Maximum = Q3 + 1.5*IQR =  11.0

Left outlier count i.e. < Minimum   0
>= Minimum and <= Q1 count          187
> Q1 and <= Q2 count                174
> Q2 and <= Q3 count                137
> Q3 and <= Maximum                 162
Right outlier count i.e. > Maximum  0
Total Outliers  0    0.0 %  of total records

Distribution is Right Skewed because Mean > Median

Looking at number of peaks in the distribution, 2 clusters can be formed based on Total Calls Made.

In [16]:
# Let's drop the Serial Number and Customer Key columns (because these are Identifier columns) 
#    and create new dataset for analysis.

cardDataTrimmed = cardData.drop(columns=['Sl_No', 'Customer Key'])
In [17]:
# Correlation between the columns

cardDataTrimmed.corr()
Out[17]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
Avg_Credit_Limit 1.000000 0.608860 -0.100312 0.551385 -0.414352
Total_Credit_Cards 0.608860 1.000000 0.315796 0.167758 -0.651251
Total_visits_bank -0.100312 0.315796 1.000000 -0.551861 -0.506016
Total_visits_online 0.551385 0.167758 -0.551861 1.000000 0.127299
Total_calls_made -0.414352 -0.651251 -0.506016 0.127299 1.000000
In [18]:
# Pair plot
sns.pairplot(cardDataTrimmed, diag_kind='kde');

Observations

Weak correlations (less than ± 30%) are ignored here.

  • Total Credit Cards and Average Credit Limit are positively correlated. Degree of Correlation 0.608860
  • Total Credit Cards and Total Bank Visits are positively correlated. Degree of Correlation 0.315796
  • Total Visits Online and Average Credit Limit are positively correlated. Degree of Correlation 0.551385
  • Total Visits Online and Total Bank Visits are negatively correlated. Degree of Correlation -0.551861
  • Total Calls Made and Average Credit Limit are negatively correlated. Degree of Correlation -0.414352
  • Total Calls Made and Total Credit Cards are negatively correlated. Degree of Correlation -0.651251
  • Total Calls Made and Total Bank Visits are negatively correlated. Degree of Correlation -0.506016

Create new feature based on various types of contact and visualize in 3D

In [19]:
# Let's create a new feature based on different types of contacts
# This feature will help in segmentation and visualization

cardDataTotalContactsCopy = cardDataTrimmed.copy()

cardDataTotalContactsCopy["Total_Contacts"] = (cardDataTotalContactsCopy["Total_visits_bank"] 
                                               + cardDataTotalContactsCopy["Total_visits_online"] 
                                                + cardDataTotalContactsCopy["Total_calls_made"])
cardDataTotalContactsCopy.head()
Out[19]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made Total_Contacts
0 100000 2 1 1 0 2
1 50000 3 0 10 9 19
2 50000 7 1 3 4 8
3 30000 5 1 1 4 6
4 100000 6 0 12 3 15
In [20]:
# Scatter 3D Plot
px.scatter_3d(cardDataTotalContactsCopy, x='Avg_Credit_Limit', y='Total_Credit_Cards', z='Total_Contacts', 
              color='Total_Contacts')

Deliverable – 2

K-means Clustering

In [21]:
random = 7
In [22]:
# Data Preprocessing - Scaling

scalar = preprocessing.StandardScaler()
cardDataKMeans = pd.DataFrame(scalar.fit_transform(cardDataTrimmed), columns=cardDataTrimmed.columns)
cardDataKMeans.head()
Out[22]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
0 1.740187 -1.249225 -0.860451 -0.547490 -1.251537
1 0.410293 -0.787585 -1.473731 2.520519 1.891859
2 0.410293 1.058973 -0.860451 0.134290 0.145528
3 -0.121665 0.135694 -0.860451 -0.547490 0.145528
4 1.740187 0.597334 -1.473731 3.202298 -0.203739
In [23]:
# K-means - Let's find the k using Average Distortions

distortions = []
cardDataDistortionsCopy = cardDataKMeans.copy()

kMeansClusters = range(1, 15)

for k in kMeansClusters:
    model=KMeans(n_clusters=k, random_state=random)
    model.fit(cardDataDistortionsCopy)
    prediction=model.predict(cardDataDistortionsCopy)
    distances = cdist(cardDataDistortionsCopy, model.cluster_centers_, 'euclidean')
    minDistance = np.min(distances, axis=1)
    averageMinDistance = sum(minDistance)/rowCount
    distortions.append(averageMinDistance)

print('Average Distortions')    
print()
print(distortions)

fig = plt.figure(figsize=[20,5])
plt.plot(kMeansClusters, distortions, 'bo-', )
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method - Average Distortions');
Average Distortions

[2.006922226250361, 1.4571553548514269, 1.1466276549150365, 1.0463825294774463, 0.9907383936093273, 0.9429753673519395, 0.9094176768725913, 0.8899781320923997, 0.8673662789777814, 0.8451905732591566, 0.8234503501657384, 0.8005954552273012, 0.7828791795225496, 0.7654781286091837]

From the elbow graphs above, it can be seen that either 4 or 5 clusters can be formed.

In [24]:
# K-means - Let's find the k using Silhouette Score

silhouette = []
cardDataSilhouetteCopy = cardDataKMeans.copy()

kMeansClusters = range(2, 15)

for k in kMeansClusters:
    model=KMeans(n_clusters=k, random_state=random)
    model.fit(cardDataSilhouetteCopy)
    prediction=model.predict(cardDataSilhouetteCopy)
    silhouette.append(silhouette_score(cardDataSilhouetteCopy, prediction))

print('Silhouette Score')    
print()
print(silhouette)


plt.plot(kMeansClusters, silhouette, 'bo-')
plt.xlabel('k')
plt.ylabel('Silhouette Score')
plt.title('Selecting k with the Elbow Method - Silhouette Score');
Silhouette Score

[0.41842496663215445, 0.5157182558881063, 0.3556670619372605, 0.2723911235693706, 0.25602552465167167, 0.24867461158845058, 0.2281464345462611, 0.21463208457611233, 0.2101016333349411, 0.2095705975616728, 0.20835374489226643, 0.21206042233459693, 0.21416985641015562]

From the elbow graphs above, it can be seen that 5 or 6 clusters can be formed.

K-decision

The Total Credit Card univariate analysis also clearly shows there can be 5 clusters.

Elbow graphs based on Average distortions and Silhouette Score shows that 5 clusters can be formed.

Let's form 5 clusters and visualize

In [25]:
cardDataFinalModelCopy = cardDataKMeans.copy()

model=KMeans(n_clusters=5, random_state=random)
model.fit(cardDataFinalModelCopy)
labels=model.predict(cardDataFinalModelCopy)

print("Silhouette Score for Final KMeans model:", silhouette_score(cardDataFinalModelCopy, labels))

cardDataFinalModelCopy['Group'] = labels

cardDataFinalModelCopy.head()
Silhouette Score for Final KMeans model: 0.2723911235693706
Out[25]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made Group
0 1.740187 -1.249225 -0.860451 -0.547490 -1.251537 1
1 0.410293 -0.787585 -1.473731 2.520519 1.891859 0
2 0.410293 1.058973 -0.860451 0.134290 0.145528 1
3 -0.121665 0.135694 -0.860451 -0.547490 0.145528 1
4 1.740187 0.597334 -1.473731 3.202298 -0.203739 4
In [26]:
# Scatter 3D Plot
px.scatter_3d(cardDataFinalModelCopy, x='Avg_Credit_Limit', y='Total_Credit_Cards', z='Group', color='Group')
In [27]:
# Box Plot
cardDataFinalModelCopy.boxplot(by='Group', figsize=(15,15));
In [ ]:
 
In [ ]:
 

Deliverable – 3

Hierarchical Clustering

In [28]:
# Data Preprocessing - Scaling

scalar = preprocessing.StandardScaler()
cardDataHierarchical = pd.DataFrame(scalar.fit_transform(cardDataTrimmed), columns=cardDataTrimmed.columns)
cardDataHierarchical.head()
Out[28]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
0 1.740187 -1.249225 -0.860451 -0.547490 -1.251537
1 0.410293 -0.787585 -1.473731 2.520519 1.891859
2 0.410293 1.058973 -0.860451 0.134290 0.145528
3 -0.121665 0.135694 -0.860451 -0.547490 0.145528
4 1.740187 0.597334 -1.473731 3.202298 -0.203739
In [29]:
#linkageMethods=['single','complete','average','ward','median']

hierarchicalResults = pd.DataFrame(columns=['Linkage Method', 'Cophenetic Correlation', 'Dendogram Distance For 5 Clusters',
                                           'Silhouette Score'])
In [30]:
def plotHierarchicalClustering(mtd, dataSet):
    linkageMatrix = linkage(dataSet, method=mtd, metric='euclidean')
    cophenetic_correlation, cophenetic_distances = cophenet(linkageMatrix, pdist(dataSet))
    print("Linkage Method:", mtd) 
    print("Cophenetic Correlation:", cophenetic_correlation)
    plt.figure(figsize=(25, 10))
    plt.title('Hierarchical Clustering Dendrogram - showing Top 15 clusters for Linkage Method "'+mtd+'"', 
              fontdict={'fontsize': 25})
    dendrogram(linkageMatrix, truncate_mode='lastp', p=15)
    dendogramDistance = 2
    clusters = fcluster(linkageMatrix, dendogramDistance, criterion='distance')
    if(len(set(clusters)) < 5):
        while(len(set(clusters)) != 5):
            dendogramDistance -= 0.01
            clusters = fcluster(linkageMatrix, dendogramDistance, criterion='distance')
    elif(len(set(clusters)) > 5):
        while(len(set(clusters)) != 5):
            dendogramDistance += 0.01
            clusters = fcluster(linkageMatrix, dendogramDistance, criterion='distance')
    
    print("Dendogram Distance where ", len(set(clusters)), " clusters found is:", dendogramDistance)
    silhouetteScore = silhouette_score(dataSet, clusters)
    print("Silhouette Score:", silhouetteScore)
    dataSet['Group'] = clusters
    # Box Plot
    dataSet.boxplot(by='Group', figsize=(15,15));
    resultFrame = pd.Series([mtd, cophenetic_correlation, dendogramDistance, silhouetteScore], index=hierarchicalResults.columns)
    return resultFrame
In [31]:
result = plotHierarchicalClustering('single', cardDataHierarchical.copy())
hierarchicalResults = hierarchicalResults.append(result, ignore_index=True)
Linkage Method: single
Cophenetic Correlation: 0.7391220243806552
Dendogram Distance where  5  clusters found is: 1.1199999999999992
Silhouette Score: 0.18361418805018428
In [32]:
result = plotHierarchicalClustering('complete', cardDataHierarchical.copy())
hierarchicalResults = hierarchicalResults.append(result, ignore_index=True)
Linkage Method: complete
Cophenetic Correlation: 0.8599730607972423
Dendogram Distance where  5  clusters found is: 3.9099999999999593
Silhouette Score: 0.2795859403678087
In [33]:
result = plotHierarchicalClustering('average', cardDataHierarchical.copy())
hierarchicalResults = hierarchicalResults.append(result, ignore_index=True)
Linkage Method: average
Cophenetic Correlation: 0.8977080867389372
Dendogram Distance where  5  clusters found is: 2.509999999999989
Silhouette Score: 0.44039753024783956
In [34]:
result = plotHierarchicalClustering('ward', cardDataHierarchical.copy())
hierarchicalResults = hierarchicalResults.append(result, ignore_index=True)
Linkage Method: ward
Cophenetic Correlation: 0.7415156284827493
Dendogram Distance where  5  clusters found is: 9.749999999999835
Silhouette Score: 0.2569177732675831
In [35]:
result = plotHierarchicalClustering('median', cardDataHierarchical.copy())
hierarchicalResults = hierarchicalResults.append(result, ignore_index=True)
Linkage Method: median
Cophenetic Correlation: 0.8893799537016724
Dendogram Distance where  5  clusters found is: 2.329999999999993
Silhouette Score: 0.4314757325455211
In [36]:
hierarchicalResults.sort_values(by='Cophenetic Correlation', ascending=False)
Out[36]:
Linkage Method Cophenetic Correlation Dendogram Distance For 5 Clusters Silhouette Score
2 average 0.897708 2.51 0.440398
4 median 0.889380 2.33 0.431476
1 complete 0.859973 3.91 0.279586
3 ward 0.741516 9.75 0.256918
0 single 0.739122 1.12 0.183614

From above, Cophenetic Correlation & Silhouette Score for Average Linkage is maximum, so that is final Linkage method for Hierarchical Clustering

In [ ]:
 
In [ ]:
 

Deliverable – 4

Silhouette score for final models

In [37]:
# KMeans - from model finalized above

model=KMeans(n_clusters=5, random_state=random)
model.fit(cardDataKMeans)
labels=model.predict(cardDataKMeans)
print("Silhouette Score for Final KMeans model:", silhouette_score(cardDataKMeans, labels))
Silhouette Score for Final KMeans model: 0.2723911235693706
In [38]:
# Hierarchical Clustering - from linkage method, dendogramic distance finalized above

linkageMatrix = linkage(cardDataHierarchical, method='average', metric='euclidean')
clusters = fcluster(linkageMatrix, 2.51, criterion='distance')
print("Silhouette Score for Final Hierarchical clustering:", silhouette_score(cardDataHierarchical, clusters))
Silhouette Score for Final Hierarchical clustering: 0.44039753024783956

Deliverable – 5

Analysis of the clusters formed

K-means Clusters v/s Hierarchical Clusters

In [39]:
# Assign groups on original trimmed data from final model labels in above steps.

cardDataCopyKMeans = cardDataTrimmed.copy()
cardDataCopyKMeans['Group'] = labels

cardDataCopyHierarchical = cardDataTrimmed.copy()
cardDataCopyHierarchical['Group'] = clusters

For K-means clusters - Find Statistical summery for each group formed and see the distribution for each column across the groups.

In [40]:
cardDataCopyKMeans.groupby(by='Group').describe().T
Out[40]:
Group 0 1 2 3 4
Avg_Credit_Limit count 108.000000 196.000000 190.000000 116.000000 50.000000
mean 12509.259259 36260.204082 31226.315789 11862.068966 141040.000000
std 5762.423183 22430.945839 21659.127922 4725.318580 34457.877272
min 3000.000000 5000.000000 5000.000000 5000.000000 84000.000000
25% 8000.000000 16000.000000 12000.000000 8000.000000 108750.000000
50% 12000.000000 35000.000000 28000.000000 11000.000000 145500.000000
75% 16000.000000 56000.000000 47000.000000 16000.000000 170000.000000
max 50000.000000 100000.000000 75000.000000 20000.000000 200000.000000
Total_Credit_Cards count 108.000000 196.000000 190.000000 116.000000 50.000000
mean 2.435185 5.551020 5.478947 2.387931 8.740000
std 1.162282 1.142399 1.139516 1.044835 1.274715
min 1.000000 2.000000 4.000000 1.000000 5.000000
25% 1.000000 5.000000 4.000000 2.000000 8.000000
50% 2.000000 6.000000 6.000000 2.000000 9.000000
75% 4.000000 7.000000 6.000000 3.000000 10.000000
max 4.000000 7.000000 7.000000 4.000000 10.000000
Total_visits_bank count 108.000000 196.000000 190.000000 116.000000 50.000000
mean 0.768519 2.494898 4.515789 1.086207 0.600000
std 0.780684 0.531060 0.501071 0.797485 0.494872
min 0.000000 1.000000 4.000000 0.000000 0.000000
25% 0.000000 2.000000 4.000000 0.000000 0.000000
50% 1.000000 3.000000 5.000000 1.000000 1.000000
75% 1.000000 3.000000 5.000000 2.000000 1.000000
max 2.000000 3.000000 5.000000 2.000000 1.000000
Total_visits_online count 108.000000 196.000000 190.000000 116.000000 50.000000
mean 3.712963 0.954082 1.010526 3.405172 10.900000
std 1.260848 0.836926 0.878849 1.164537 2.880264
min 1.000000 0.000000 0.000000 1.000000 6.000000
25% 3.000000 0.000000 0.000000 2.000000 8.000000
50% 4.000000 1.000000 1.000000 3.500000 11.000000
75% 5.000000 2.000000 2.000000 4.000000 13.000000
max 10.000000 3.000000 2.000000 5.000000 15.000000
Total_calls_made count 108.000000 196.000000 190.000000 116.000000 50.000000
mean 8.592593 2.061224 1.936842 5.267241 1.080000
std 1.050474 1.452258 1.409045 1.129507 0.876915
min 7.000000 0.000000 0.000000 1.000000 0.000000
25% 8.000000 1.000000 1.000000 4.000000 0.000000
50% 9.000000 2.000000 2.000000 5.000000 1.000000
75% 9.000000 3.000000 3.000000 6.000000 2.000000
max 10.000000 4.000000 4.000000 7.000000 3.000000
In [41]:
cardDataCopyKMeans.groupby(by='Group')['Avg_Credit_Limit'].plot(kind='kde');
In [42]:
cardDataCopyKMeans.groupby(by='Group')['Total_Credit_Cards'].plot(kind='kde');
In [43]:
cardDataCopyKMeans.groupby(by='Group')['Total_visits_bank'].plot(kind='kde');
In [44]:
cardDataCopyKMeans.groupby(by='Group')['Total_visits_online'].plot(kind='kde');
In [45]:
cardDataCopyKMeans.groupby(by='Group')['Total_calls_made'].plot(kind='kde');

For Hierarchical clusters - Find Statistical summery for each group formed and see the distribution for each column across the groups.

In [46]:
cardDataCopyHierarchical.groupby(by='Group').describe().T
Out[46]:
Group 1 2 3 4 5
Avg_Credit_Limit count 50.000000 222.000000 1.0 386.000000 1.0
mean 141040.000000 12027.027027 50000.0 33541.450777 100000.0
std 34457.877272 4602.693407 NaN 21952.240769 NaN
min 84000.000000 3000.000000 50000.0 5000.000000 100000.0
25% 108750.000000 8000.000000 50000.0 13000.000000 100000.0
50% 145500.000000 12000.000000 50000.0 31000.000000 100000.0
75% 170000.000000 16000.000000 50000.0 51750.000000 100000.0
max 200000.000000 20000.000000 50000.0 75000.000000 100000.0
Total_Credit_Cards count 50.000000 222.000000 1.0 386.000000 1.0
mean 8.740000 2.400901 3.0 5.520725 2.0
std 1.274715 1.099815 NaN 1.128537 NaN
min 5.000000 1.000000 3.0 4.000000 2.0
25% 8.000000 1.000000 3.0 4.000000 2.0
50% 9.000000 2.000000 3.0 6.000000 2.0
75% 10.000000 3.000000 3.0 6.000000 2.0
max 10.000000 4.000000 3.0 7.000000 2.0
Total_visits_bank count 50.000000 222.000000 1.0 386.000000 1.0
mean 0.600000 0.932432 0.0 3.492228 1.0
std 0.494872 0.801544 NaN 1.131000 NaN
min 0.000000 0.000000 0.0 1.000000 1.0
25% 0.000000 0.000000 0.0 3.000000 1.0
50% 1.000000 1.000000 0.0 3.000000 1.0
75% 1.000000 2.000000 0.0 5.000000 1.0
max 1.000000 2.000000 0.0 5.000000 1.0
Total_visits_online count 50.000000 222.000000 1.0 386.000000 1.0
mean 10.900000 3.531532 10.0 0.984456 1.0
std 2.880264 1.140134 NaN 0.858732 NaN
min 6.000000 1.000000 10.0 0.000000 1.0
25% 8.000000 2.250000 10.0 0.000000 1.0
50% 11.000000 4.000000 10.0 1.000000 1.0
75% 13.000000 4.000000 10.0 2.000000 1.0
max 15.000000 5.000000 10.0 3.000000 1.0
Total_calls_made count 50.000000 222.000000 1.0 386.000000 1.0
mean 1.080000 6.873874 9.0 2.010363 0.0
std 0.876915 1.984634 NaN 1.430610 NaN
min 0.000000 1.000000 9.0 0.000000 0.0
25% 0.000000 5.000000 9.0 1.000000 0.0
50% 1.000000 7.000000 9.0 2.000000 0.0
75% 2.000000 9.000000 9.0 3.000000 0.0
max 3.000000 10.000000 9.0 4.000000 0.0
In [47]:
# Remove two records from group 3 and 5 to avoid error during plot as it requires more than one records.
df_For_Graph = cardDataCopyHierarchical[(cardDataCopyHierarchical["Group"] != 3) & (cardDataCopyHierarchical["Group"] != 5)]
In [48]:
df_For_Graph.groupby(by='Group')['Avg_Credit_Limit'].plot(kind='kde');
In [49]:
df_For_Graph.groupby(by='Group')['Total_Credit_Cards'].plot(kind='kde');
In [50]:
df_For_Graph.groupby(by='Group')['Total_visits_bank'].plot(kind='kde');
In [51]:
df_For_Graph.groupby(by='Group')['Total_visits_online'].plot(kind='kde');
In [52]:
df_For_Graph.groupby(by='Group')['Total_calls_made'].plot(kind='kde');

Observations

  • In K-Means, Group 0 and 3 have similar number of items and group 1 and 2 have similar. However, in Hierarchical, group counts are not similar. Thus Hierarchical clustering distinguishes the groups correctly.
  • In Hierarchical, group 3 and 5 have only one record, thus able to find outliers.
  • Hierarchical clustering tries to minimize the mixture of Gausians across the clusters formed.
  • Though Hierarchical clustering gives more accurate results, we have to keep the computational requirements in mind while making decision whether to use Hierarchical clustering or not.

Deliverable – 5

Answers to Key questions

As the Hierarchical clustering proved to be best in this case, below answers are based on the Hierarchical clustering and excluding the outliers. (i.e. group 3 and 5)

  1. How many different segments of customers are there?

    There are 3 different segments of customers.

  2. How are these segments different from each other?

    • Customers with higher credit limit (group 1)
      • Have 5 to 10 credit cards.
      • Visits bank very less or never.
      • Prefer online visits.
      • Calls bank very less or never.

    • Customers with moderate credit limit (group 4)
      • Have 4 to 7 credit cards.
      • Visits bank at least once.
      • Visit online very less or never.
      • Calls bank very less or never.

    • Customers with lower credit limit (group 2)
      • Have 1 to 4 credit cards.
      • Visits bank very less or never.
      • Visit online at least once.
      • Calls bank 5 times or more.



  3. What are your recommendations to the bank on how to better market to and service these customers?

    • As customers having higher credit limit do visit online, try to contact through e-advertising, emails, voice/text messages etc.
    • As customers having moderate credit limit visit bank at least once, talk during their visits to bank, provide information about the schemes answer their queries and try to convert the opportunities.
    • As customers having lower credit limit calls more often and also visits online at least once, explaining about bank schemes during the call or playing pre-recorded messages during their wait time while connecting with an executive of customer care will help to maximize the conversion of opportunities. And also try to contact through e-advertising, emails, voice/text messages etc.


In [ ]: